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Abstract 

Background: The presence of tandem amino acid repeats (AARs) is one of the signatures of eukaryotic proteins. 
AARs were thought to be frequently involved in bio-molecular interactions. Comprehensive studies that primarily 
focused on metazoan AARs have suggested that AARs are evolving rapidly and are highly variable among species. 
However, there is still controversy over causal factors of this inter-species variation. In this work, we attempted to 
investigate this topic mainly by comparing AARs in orthologous proteins from ten angiosperm genomes. 

Results: Angiosperm AAR content is positively correlated with the GC content of the protein coding sequence. 
However, based on observations from fungal AARs and insect AARs, we argue that the applicability of this kind of 
correlation is limited by AAR residue composition and species' life history traits. Angiosperm AARs also tend to be 
fast evolving and structurally disordered, supporting the results of comprehensive analyses of metazoans. The 
functions of conserved long AARs are summarized. Finally, we propose that the rapid mRNA decay rate, alternative 
splicing and tissue specificity are regulatory processes that are associated with angiosperm proteins harboring 
AARs. 

Conclusions: Our investigation suggests that GC content is a predictor of AAR content in the protein coding 
sequence under certain conditions. Although angiosperm AARs lack conservation and 3D structure, a fraction of 
the proteins that contain AARs may be functionally important and are under extensive regulation in plant cells. 



Background 

Tandem amino acid repeats (AARs), or homopeptides, 
are protein segments that comprise a continuous array of 
identical residues. As repetitive DNA is very abundant in 
eukaryotic genomes [1], AARs are frequently found in 
the proteomes of eukaryotes [2-4] . These simple peptides 
can be encoded by tandem repeats of the same codon, 
which are vulnerable to point mutations, or by a mixture 
of synonymous codons [5]. These repetitive codon tracts 
are primarily introduced by either replication slippage [6] 
or recombination [7]. 

AARs are often situated in disordered regions of pro- 
teins that lack regular 3D structures [8]. Nevertheless, 
over the past two decades, increasing attention has been 
paid on biological importance of AARs (see [9] for a 
recent review) which have long been regarded as junk 
sequences [10]. AARs have been shown to be associated 
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with several diseases. For example, the expansion of a 
glutamine repeat may induce Huntington's disease and 
other neuro-degenerative diseases [11]. Beneficial effects 
of AARs have also been uncovered. An example is the 
glutamine repeat that appears in a key component of the 
biological clock in the fungus Neurospora crassa White 
Collar- 1. This AAR was suggested to control circadian 
period length [12]. Large-scale analyses indicate that 
AARs tend to participate in the regulation of transcrip- 
tion [8,13,14] and are frequently involved in protein-pro- 
tein interactions [15]. 

AARs are highly polymorphic and fast-evolving 
sequences [9,16]. In line with the accelerated rate of evo- 
lution for protein segments that are in or near AARs, 
selective constraints are thought to be relaxed around 
AARs [8,17]. There are competing interpretations of the 
rapid evolution of AARs. Some believed that AARs 
evolve in a largely neutral fashion [18], partly as a conse- 
quence of the balance between replication slippage and 
point mutations [6] . Based on shifts of the frequency dis- 
tribution of coding tri-nucleotide repeats compared to 
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that of non-coding tri-nucleotide repeats, Mularoni et al 
proposed that selection plays an important role in AAR 
evolution [19]. There is also evidence for positive selec- 
tion on the AARs from case studies of a few mammalian 
genes [20,21]. 

The frequency and size of AARs show inter- and intra- 
species variation both in large-scale comparisons [17] 
and in studies focused on vertebrates [8,13,19] or fruit 
flies [22]. The causal factors underlying this variation are 
still a matter of dispute [13,17], and some have attributed 
them to GC content bias [16,18,23]. In plants, repetitive 
DNA is widely used as a genetic marker, and its variation 
among transcripts has been observed [24] . Nevertheless, 
in contrast to animal AARs, plant AARs have not been 
intensively investigated, except in a recent report based 
on two model plants, Arabidopsis {Arabidopsis thaliana) 
and rice (Oryza sativa) [14]. 

In this report, we revisit the questions surrounding 
AARs in plants in the light of current accumulation of 
plant whole genome sequences. A comparison of 1-to-l 
orthologous proteins between ten sequenced angiosperm 
species revealed a positive correlation between AAR con- 
tent and GC content, a finding that may be applicable to 
some other non-metazoan taxa. Other factors related to 
AAR content variation were also discussed. We attempted 
to summarize the functions of conserved long angiosperm 
AARs and their host genes in the context of the rapid evo- 
lution of AARs and their flanking regions. Our analysis 
also supports the idea that AARs are widely associated 
with protein structural disorder. Finally, we suggest that 
transcripts of repeat-containing proteins (RCPs) are under 
various levels of regulation in plant cells. 

Results and Discussion 

Correlation between AAR content and GC content in 
plants 

We predefined an AAR as an uninterrupted run of four or 
more identical amino acids. Our 1-to-lortholog dataset 
contains 4, 281 groups of proteins from six eudicots[25] 
[26] [27] [28] [29] [30] (A. thaliana, Carica papaya, Glycine 
max, Malus x domestica, Populus trichocarpa and Vitis 
vinifera) and four monocots [31],[32][33][34] (Brachypo- 
dium distachyon, O. sativa, Sorghum bicolor and Zea 
mays). The abbreviations used and general information 
about the genome sequence data of these ten species are 
given in Table 1. All of the following analyses within the 
spectrum of angiosperm were based on this dataset, unless 
stated otherwise. 

Similar to other eukaryotes, angiosperm proteins are 
enriched in AARs (0.84 AAR per protein on average). 
Because short AARs may be derived from the interrup- 
tion of a long AAR, we used repeated residues per 1000 
amino acids (RRPK, Repeated Residues per Kilo Amino 
Acids, defined as the ratio of the total AAR length to the 



protein length, multiplied by 1000) to represent the AAR 
content of a protein or protein segment. For example, 
the RRPK of peptide "QQQQQSTWQQQQAAE" is 91 
15 x 1000 = 600. There is a nearly 3-fold variation in 
RRPK between these ten species, with values that range 
from 5.06 (grape) to 15.25 (rice). It is somehow striking 
that large genomes or proteomes do not necessarily have 
higher RRPK. 

We noticed that orthologs in monocots have an elevated 
RRPK and that the GC content (of their coding sequences) 
is also higher. A strong linear correlation between RRPK 
and GC content can be observed (Pearsons correlation 
coefficient, r = 0.87, p = 1.1 x 10" 3 ; Figure 1A), although 
no positive correlations observed within eudicots or 
monocots, partially due to a limited taxonomy coverage of 
the available genomes. To test whether this phenomenon 
is specific to angiosperms, we also tested a set of 1-to-l 
orthologs between Arabidopsis, moss (Physcomitrella 
patens [35]) and green algae (Chlamydomonas reinhardtii 
[36] and Volvox carteri [37]) and also observed a strong 
positive correlation (r = 0.97, p = 2.5 x 10" 2 ; Figure IB). 
Moreover, within each angiosperm species, a weak but sig- 
nificant positive correlation between protein RRPK and 
GC content was observed (r = 0.15-0.43, p < 1.0 x 10" 10 ; 
Figure 1C and ID). Finally, when protein sequences were 
equally divided into three parts, an accumulation of AARs 
in the N-terminus could be observed, which is in line with 
the elevated GC content of this region (Figure IE). How- 
ever, the C-terminus also had a higher RRPK in compari- 
son with the middle segments for all of the species except 
rice (Figure IE), similar to what has been observed in ani- 
mals [16,22]. It is worth mentioning that AARs have been 
proposed to follow a negative gradient from the N-termi- 
nus to the C-terminus in plant proteins [14]. One explana- 
tion for this discrepancy is that there are differences in the 
definition of the protein terminus between the two studies. 
Zhang et al. measured the absolute position of the AARs 
[14] and thus may have counted all of the repeats that are 
present in short proteins as being in the N-terminus. 

High GC content favors replication slippage and, thus, 
the generation of AARs, which has been proposed in a 
number of reports [13,16,38]. On the other hand, GC 
content has also been treated as an indicator of the local 
recombination rate [39]. Angiosperm species with high 
recombination rates (i.e., exactly, higher average centi- 
Morgan per megabase), such as Arabidopsis and rice 
[40], are relatively enriched for AARs (Figure 1A). We 
attempted to test the association of RCPs with recombi- 
nation hotspots in Arabidopsis by exploiting publicly 
available extensive SNP data [41]. A total of 293 putative 
hotspot neighboring genes (see Materials and Methods) 
were identified. At the whole proteome level, the fraction 
of hotspot neighboring genes in genes encoding RCPs is 
higher than that in genes not encoding RCPs in our 
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Table 1 Summary of the ten angiosperm genomes included in this study 



Organism 



Abbreviation 



Genome size (Mbp) 



Number of proteins 



Reference 



Eudicot 

Arabidopsis tholiono 
Corico popoyo 
Glycine mox 
Molus x domestico 
Populus trichocorpo 
Vitis vinifero 
Monocot 

Brachypodium distachyon 
Oryza sativa 
Sorghum bicolor 
Zea mays 



Arabidopsis 
papaya 
soybean 
apple 

cottonwood 
grape 

false brome 
rice 

sorghum 
maize 



120 
372 

1, 100 
742 
550 
500 

272 
382 
735 

2, 500 



27, 1692 
27, 181 
46, 260 
62, 997 
40, 664 

26, 092 

25, 525 
56, 795 

27, 561 
32, 606 



25 

[26] 

[27] 

[28] 

[29] 

[30] 

[32] 
[31] 
[33] 
[34] 



The genome size is the total length of chromosomal sequences. A gene with multiple protein products was counted as one protein in this table. 



dataset (1.3% vs. 0.93%, Fishers exact test, p = 0.01), indi- 
cating that the influence of recombination on the AAR 
frequency, although limited, cannot be excluded. 

The relationship between AAR content and GC content is 
distorted by AAR composition and life history traits 

There is controversy over the relationship between GC 
content and AAR content. A positive correlation between 
GC content and AAR content has been observed in some 
mammalian species [16], while in a wider spectrum of 
taxa, a negative correlation was proposed [18]. Thus, we 
examined the relationship between GC content and AAR 
content within two additional taxonomic groups from 
distinct eukaryotic clades, Sordariomycetes fungi and 
Diptera insects (summarized in Table SI and Table S2 in 
Additional File 1, respectively). These 1-to-l ortholog 
groups contain 4, 047 and 3, 680 proteins from fungi and 
insects, respectively. 

A positive correlation was observed in fungi (r = 0.78, p 
= 2.2 x 10~ 2 ; Figure SI in Additional File 1). Both plants 
and fungi harbor a large fraction of AARs that are encoded 
by GC-rich codons (39.3% and 36.9% on average, respec- 
tively; see also Figure 2), including alanine, glycine and 
proline repeats. Removal of these three types of AAR 
would diminish the positive correlation (p = 0.19 and 
3.6 x 10" 2 for plants and fungi, respectively). In contrast, 
fruit flies, whose RRPK shows a negative correlation with 
GC content (r = -0.72, p = 8.9 x 10" 3 , Figure S2 in Addi- 
tional File 1), were relatively enriched for glutamine 
repeats (22.1% on average) but not for types of AAR that 
were encoded by GC-rich codons (31.7% on average), indi- 
cating the influence of AAR composition on the relation- 
ship between AAR content and GC content. Additionally, 
no significant correlation was identified in the insect 
group as a whole because mosquito proteins, which con- 
tain fewer glutamine repeats (15.9% on average) than their 



fruit fly orthologs, seem to accumulate AARs with elevated 
GC content (Figure S2). 

However, GC content and residue composition are not 
the only factors that influence the AAR content. For 
example, both maize and grape have relatively higher GC 
content (Figure 1A) and fewer glutamine repeats (< 4.5%), 
but their RRPK are the lowest among eudicots and mono- 
cots, respectively (Figure 1A). These two species share at 
least two life-history traits: (1) relatively "large body size" 
and (2) cross-pollination. We intentionally used quotes in 
this paragraph to emphasize that, owing to the high plasti- 
city of plant development, caution should be used when 
linking body size to genomic signatures. Conversely, self- 
pollinating "small grasses", such as Arabidopsis and rice, 
have abundant AARs. Rice orthologs are so abundant in 
AARs that they appear as an outlier in the linear regres- 
sion (Grubbss test, p = 0.015; Figure 1A). A recent survey 
[24] showed that barley {Hordeum vulgare) has a higher 
fraction of RCPs than sugarcane {Saccharum officinarum). 
Interestingly, the former is a self-pollinating "small grass", 
whereas the latter is a cross-pollinating "large grass". In all, 
life-history traits are seeming cofactors of AAR content 
and deserve re-examination when more angiosperm gen- 
ome sequences become available. 

Taken together, the driving force shaping the presence 
and content of different types of AARs or low- complexity 
sequences appear to be complex, as was recently suggested 
for Plasmodium falciparum [42]. The interplay between 
GC content, AAR residue composition and life-history 
remains complicated and needs further investigation. 

Rapid evolution of angiosperm AARs and their functions 

Like their animal counterparts [8], many angiosperm AARs 
have not been conserved over a long period of evolution. 
This trend is indicated by the observation that approxi- 
mately 75% of AARs fail to align to the corresponding 
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Figure 1 Correlation between AAR and coding GC content in plants. (A) Among angiosperm orthologs (the abbreviations used are detailed 
in Table 1), the average values are shown; (B) among Arabidopsis, moss and green algae orthologs, the average values are shown; (C) for all 
angiosperm orthologs from Arabidopsis; (D) for all angiosperm orthologs from rice; (E) in different regions of orthologous angiosperm proteins. 



region in any of the other orthologs (i.e., the corresponding average dN/dS ratio of the AAR flanking regions in corn- 
regions in the multiple alignment of other orthologs are parison to RCPs as a whole, was also observed (Mann- 
filled with gaps). A faster rate of evolution, as estimated by Whitney U test, p < 1 x 10" 9 for all species; Table S3 in 
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Figure 2 Residue composition in AARs and across the entire set of orthologs. The fractions of residues from AARs and from the entire set 
of orthologs are shown as green columns and yellow columns, respectively. Only positive error bars are shown. 



Additional File 2), supporting previous work that was con- 
ducted in other species [17,43]. Although it has been sug- 
gested that purifying selection is relaxed in flanking regions 
of AARs, only about 3% of these flanking regions show 
signs of positive selection, i.e., a dN/dS greater than 1 
(Table S4 in Additional File 2). Assuming that the fraction 
of regions under positive selection would be underesti- 
mated by the average dN/dS, we also calculated pairwise 
dN/dS for three pairs of species: (1) Arabidopsis and 
papaya, (2) rice and false brome and (3) maize and sor- 
ghum. The fractions were still limited (Welch's t-test, p > 
0.05; Table S4), indicating that positive selection is not a 
ubiquitous evolutionary process in AAR flanking regions. 

Some pathogenesis-related proteins from P. falciparum 
were enriched with long asparagines repeats, and this phe- 
nomenon has been proposed as a reflection of selection 
against human immune systems [44]. We tested the func- 
tion enrichment of rice RCPs with higher (> 20) RRPK, 
but did not find specific terms other than the regulation of 
transcription (detailed data not shown). Even though some 
molecular functions are enriched for RCPs [14], a loss of 
conservation can make the detailed genetic or biochemis- 
try assays of the function of the AARs difficult. Fortu- 
nately, a few conserved long AARs still exist and can serve 
as targets for further experimental analysis. We chose con- 
served long AARs based on two criteria: (1) longer than 
seven residues and (2) the corresponding regions from at 
least eight other orthologs could be aligned with a multiple 
sequence alignment identity that is not lower than 50%. 
We then mapped these 18 AARs onto the corresponding 
regions in the Arabidopsis orthologs (Table 2 and Table 
S5 in Additional File 2) and found a few with indirect evi- 
dence [45] [46] [47] [48] [49] of being functionally important 
in Arabidopsis (Table 2). For example, truncation of the 
N-terminal acid domains containing long serine repeats 
from ABI3 (ABA INSENSITIVE 3) largely abolishes its 



activity [45]. Similarly, truncation of a domain with a glu- 
tamine rich region in SEU (SEUSS) can cause severe devel- 
opmental defects [46]. Nonetheless, the functions of most 
of these 18 AARs and even the functions of their host 
genes remain to be surveyed (Table S5). 

Angiosperm AARs tend to be structurally disordered and 
regulated at the transcript level 

One explanation for the rapid evolution of AARs is that 
AARs tend to be disordered and thus lack structural con- 
straints during evolution [50], an idea recently stressed by 
Simon and Hancock [8]. Indeed, disorder-promoting resi- 
dues [51] such as serine, alanine, glutamine, glycine and 
proline are overrepresented in angiosperm AARs in com- 
parison with the entire set of orthologs (Mann- Whitney 
U test, p < 0.01; Figure 2). To test whether angiosperm 
AARs tend to be fully disordered (i.e., embedded in disor- 
dered regions), we used PONDR® VSL2B [52] and 
IUPred [53] to predict disordered regions in our set of 
orthologs. These two software packages take advantage of 
distinct features and strategies to predict ordered/disor- 
dered status for each residue. PONDR® VSL2B and 
IUPRED disagree about the absolute fraction of fully dis- 
ordered AARs (Table 3). In an independent benchmark- 
ing test with default cutoffs [54], IUPRED achieved 59.5% 
sensitivity with 95.6% specificity; the sensitivity of VSL2 
reached 75.5%, but its specificity was 79.4% (the sensitiv- 
ity of VSL2B was shown to be approximately 5% lower 
compared with VSL2 at a similar specificity [52]). Thus, 
this disagreement reflects the different trade-offs between 
false positives and false negatives that are inherent to dif- 
ferent predictors. We tested whether the tendency of 
AARs to be fully disordered was significant by counting 
how many times, in 1000 trials, randomly selected equal 
length segments in RCPs showed a higher fraction of fully 
disordered segments. None of the 1000 random trials in 
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Table 2 Putative functions for conserved long AARs in Arabidopsis 



Protein 


Type 


Description (gene/AAR) 


Clue 


AT3G24650.1 


S 


ABB is a key component of the ABA signal transduction pathway ./See text for details of AAR function. 


Biochemical 
[45] 

Phenotypic [46] 


AT1G43850.1 


Q 


SEU, together with LUG {LEUNIG), controls the development of several organs/See text for details of AAR 
function. 


AT4G32551.1 


Q 


LUG, see above/The AAR is thought to be involved in the assembly of transcriptional co-repressors. 


Speculation 
only [47] 


AT5G67470.1 


P 


FH6 {FORMIN H0M0L0G 6) binds profilin and is involved in actin-nucleating activity /The AAR may directly 
contribute to its binding activity. 


Speculation 
only [48] 


AT1G25540.1 


Q 


PFT1 {PHYTOCHROME AND FLOWERING TIME 7) is a transcription factor that controls the flowering time/The 
AAR may be involved in transcriptional activation. 


Speculation 
only [49] 



any of the 10 species resulted in a higher fraction of fully 
disordered segments, suggesting that the fraction of fully 
disordered AARs is indeed significantly high (i.e., p < 
0.001). 

The tendency of AARs to be disordered indicates that 
RCPs may be under extensive regulation in plant cells, 
to prevent them from inducing dosage-sensitive pheno- 
types, as protein structural disorder was suggested to be 
associated with dosage-sensitive phenotypes in model 
metazoans [55]. The first line of evidence comes from 
the observation that transcripts encoding RCPs decay 
more quickly in Arabidopsis, similar to the transcripts 
of disordered proteins in human [56]. Both within 
orthologs and at the whole proteome scale, transcripts 
of RCPs have shorter half-lives than the rest (Welch's t- 
test, p < 2.2 x 10~ 16 ; Figure 3A). A similar result was 
obtained by comparing RCPs with non-RCPs encoded 
by GC-rich (GC content not smaller than 0.45) coding 
sequences (Welch's t-test, p < 2.2 x 10" 16 ). Second, in 
the whole proteomes of Arabidopsis and rice, the frac- 
tions of alternatively spliced genes (i.e., genes with mul- 
tiple gene models) were higher for RCPs than for non- 
RCPs (17.7% vs. 15.9% and 12.1% vs. 10.9%, respectively; 
Fisher exact test, p = 0.021 and 7.4 x 10" 3 , respectively). 
Moreover, protein segments encoded by alternatively 

Table 3 Fraction of fully disordered AARs 



Abbreviation 


VSL2B Fraction 


lUPred Fraction 


Arabidopsis 


82.5% 


48.0% 


papaya 


77.8% 


41.6% 


soybean 


78.6% 


46.3% 


apple 


67.9% 


45.0% 


cottonwood 


79.6% 


45.5% 


grape 


73.5% 


39.7% 


false brome 


80.2% 


44.6% 


rice 


81.5% 


45.6% 


sorghum 


80.2% 


44.9% 


maize 


78.4% 


42.4% 



All fractions of fully disordered AARs were significantly high in comparison 
with random samples (p < 0.001). 
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Figure 3 Regulation of RCPs at the transcript level (A) 

Comparison of mRNA half lives of RCPs (green boxes) and non- 
RCPs (yellow boxes) in Arabidopsis. Large outliers (> 50 h) were not 
shown. (B) Comparison of the tissue specificity index of RCPs with a 
single gene model (green boxes) and non-RCPs with a single gene 
model (yellow boxes) in Arabidopsis and rice orthologs. 
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spliced exons showed significantly higher repeat content 
than those encoded by constitutively spliced exons 
(Welch's t-test, p = 1.1 x 10" 28 and 5.1 x 10" 47 for Ara- 
bidopsis and rice, respectively; Table S6 in Additional 
File 2), which further supports the association between 
AARs and alternative splicing that has been proposed by 
Haerty and Golding based on their observations in 
metazoans [57]. For Arabidopsis orthologous RCPs and 
rice orthologous RCPs, a higher average RRPK of pro- 
tein segments encoded by alternatively spliced exons 
was also observed (Welch's t-test, p = 7.6 x 10" 6 and 4.6 
x 10" 11 , respectively; Table S7 in Additional File 2), 
while we found no higher fraction of RCPs to be alter- 
natively spliced genes in comparison with non-RCPs 
(23.2% vs. 24.3% and 31.3% vs.37.7%, respectively; Fisher 
exact test, p = 0.31 and 7.7 x 10" 3 , respectively). A third 
regulatory process proposed here is tissue-specific 
expression, as measured by the tissue specificity index 
(see Materials and Methods), may be another comple- 
mentary regulatory process. Orthologous RCPs from 
both Arabidopsis and rice had a relatively higher tissue 
specificity index than non-RCPs (Welch's t-test, p = 2.0 
x 10" 4 and 1.4 x 10" 15 , respectively). If only genes with a 
single model were considered, the difference seemed to 
be limited but could still be observed (Welch's t-test, 
p = 1.5 x 10" 2 and 5.9 x 10" 11 , respectively), indicating 
that tissue-specific expression may partially complement 
alternative splicing in the regulation of RCPs. 

Conclusions 

Angiosperm proteins are enriched in AARs whose content 
is positively correlated with the GC content in the coding 
sequences. It has also been suggested that the correlation 
between AAR content and GC content is influenced by 
residue composition of AARs as well as life-history traits. 
Similarly to AARs in many sequenced eukaryotic species, 
angiosperm AARs evolve rapidly and tend to be disordered. 
Although AARs are usually not well conserved, we identi- 
fied 18 conserved long AARs for further detailed analysis. 
As potentially promiscuous molecules, RCPs are under at 
least three putative transcript-level regulatory controls in 
plant cells, including faster transcript decay, alternative 
splicing and tissue specificity of gene expression. 

Methods 

Collection of sequences 

Sequences of A. thaliana (Version 9) and O. sativa (Ver- 
sion 6.1) were downloaded from TAIR [58] and RGAP 
[31], respectively. All of the other plant sequences were 
downloaded from the Phytozome 6.0 database [59]. The 
sources of the non-plant genome sequences are summar- 
ized in Tables SI and S2 in Additional File 1. For genes 
with multiple protein products (gene models), only the 



representative one (if available) or the longest one was 
retained. 

To search for 1-to-l orthologs between species within 
certain taxonomic groups, InParanoid 4.1, one of algo- 
rithms with the lowest false-positive rates [60], was initially 
employed to identify pair-wise orthologs between the 
reference proteomes (A. thaliana for plant species, N. 
crassa for fungus species and Drosophila melanogaster for 
insects, excluding proteins encoded by the mitochondrial/ 
chloroplast genomes) and the proteomes of the other spe- 
cies, with a score cutoff of 40. Sets of 1-to-l orthologs 
found in all of the species within each group were 
obtained by collecting the intersection of the ortholog 
pairs. 

AAR identification, GC content calculation and statistical 
tests 

We used in-house PERL scripts to collect data on the 
length, composition and position of AARs in protein 
sequences and to calculate the GC content. All statisti- 
cal tests were implemented in R 2.12.1 [61]. 

Recombination hotspots 

We deduced the recombination hotspot at the Arabidopsis 
genome from SNP data described in [41] . Informative SNP 
markers in a chromosome were selected by the TAGGER 
application in HaploView 4.1 [62] with "-maxDistance 20 
-aggressiveTagging -tagrsqeutoff 0.8" options, excluding 
SNPs identified as "N" in more than three out of 20 Arabi- 
dopsis accessions. Mainly due to the greedy marker selec- 
tion approach of TAGGER along the whole chromosomes, 
the total number of informative SNP markers selected 
here is 60, 904. The hotspots were searched in 40-marker- 
long sliding windows by PHASE 2.1.1 [63], with "-MR1 1 - 
X10" options. These windows moved 20 markers per step. 
Windows that were longer than 100 kb were discarded. A 
hotspot was defined as a two-marker interval with a Bayes 
Factor that was higher than 10 in comparison with the 
background recombination rate [41]. Positions of the hot- 
spot were compared with the position of genes in the Ver- 
sion 8 genome to collect genes that overlap with 
recombination hotspots. We called these genes putative 
hotspot neighboring genes (gene IDs were transferred to 
Version 9 for obsolete loci). We do not use Version 9 gen- 
omes here because approximately half of the markers fail 
to map to this version of genome [58]. 

dN/dS calculation 

The alignment of orthologous coding sequences was 
guided by a multi-protein sequence alignment that was 
generated by MAFFT 6.849 [64]. A flanking region was 
defined as 33 amino acids on both sides of an AAR; this 
region was truncated if the end of a protein was reached 
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or if there was an adjacent AAR closer within 33 amino 
acids. PAML 4.3 ynOO tool [65] was used to calculate 
the dN/dS ratio with default parameters. 

Disordered region prediction 

PONDR® VSL2B [52] and IUPred [53] were used for 
predictions of disorder, with default parameters. We did 
not use PONDR® VSL2 because of limitations in com- 
putational capability. We used default thresholds (0.5) to 
predict disordered residues for both predictors. 

Probing transcript level regulation features of RCPs 

AtGenExpress [66] data (Accession: GSE5630, GSE5631, 
GSE5632, GSE5633 and GSE5634) were downloaded 
from the GEO database [67]. Data not derived from the 
Columbia-0 ecotype were discarded. A developmental 
time series of rice transcriptome data [68] (Accession: 
GSE13988, GSE14298, GSE14299 and GSE14300) was 
also downloaded from GEO [67]. Probes were mapped 
to loci in Arabidopsis and rice according to the mapping 
files that were provided by TAIR [58] and the Rice 
Array Database [69], respectively. All of the expression 
values that were presented above background (labeled as 
"Presence") were used and log-transformed. The expres- 
sion values were normalized by subtracting the average 
expression value of a tissue and then adding the average 
expression value of the whole dataset. The tissue specifi- 
city index [70,71] was calculated as follows: 

Ed-f) 

Tissue Specificity Index = — — ^ max 

where Xi is the expression value in the i th tissue, x max 
is the highest expression value among all of the tissues 
and N is the total number of tissues. For loci with mul- 
tiple probes, the average tissue specificity index was 
used. 

To calculate the RRPK of the protein segments that 
were encoded by different types of exons, the protein 
sequences of RCPs were mapped to their exon 
sequences using our in-house PERL scripts. Only exons 
that encode proteins were retained for calculation. The 
mRNA half-life for each probe was obtained from [72]. 
For loci with multiple probes, the average mRNA half- 
life was used. 

Additional material 



Additional file 1: The relationship between AAR content and GC 
content in fungi and insects. This file contains two tables (Tables SI 
and S2) that list the sources of the fungi and insect genome sequences 
that were used in this study, as well as two figures (Figures S1 and S2) 
that show the corresponding relationship between AAR content and GC 
content. 



Additional file 2: Fast evolving AAR flanking regions, conserved 
long AARs of unknown function and RRPK of alternatively spliced 

exons. This file contains Tables S3-S7. 
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AAR: (Tandem) Amino Acid Repeat; RCP: Repeat Containing Protein; RRPK: 
Repeated Residues per One Kilo Amino Acids; SNP: Single Nucleotide 
Polymorphism. 
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